Random Forest in Machine Learning

Foresight from the Forest

Forest Foresight (Advisor: Dr. Seals)

November 14, 2024

Random Forest in Machine Learning: Foresight from the ForestS

A Random Forest Guided Tour

by Gérard Biau and Erwan Scornet [1]

  • Origin & Success: Introduced by Breiman (2001) [2], Random Forests excel in classification/regression, combining decision trees for strong performance.
  • Versatility: Effective for large-scale tasks, adaptable, and highlights important features across various domains.
  • Ease of Use: Simple with minimal tuning, handles small samples and high-dimensional data.
  • Theoretical Gaps: Limited theoretical insights; known for complexity and black-box nature.
  • Key Mechanisms: Uses bagging and CART-split criteria for robust performance, though hard to analyze rigorously.

Tree Prediction

Each tree estimates the response at point \(x\) as:

\[ m_n(x; \Theta_j, D_n) = \frac{\sum_{i \in D_n(\Theta_j)} \mathbf{1}_{X_i \in A_n(x; \Theta_j, D_n)} Y_i}{N_n(x; \Theta_j, D_n)} \]

  • \(D_n(\Theta_j)\) is the resampled data subset,
  • \(A_n(x; \Theta_j, D_n)\) is the cell containing \(x\), and
  • \(N_n(x; \Theta_j, D_n)\) is the count of points in the cell

Forest Prediction

The forest estimate for \(M\) trees is:

\[ m_{M, n}(x) = \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) \]

  • \(M\) is the total number of trees
  • \(m_n(x; \Theta_j, D_n)\) represents the prediction from each tree, and
  • the forest average yields the final prediction.

Random Forest Regression

Splitting Criteria:

  • The CART-split criterion is used to find the best cut applying:

\[ L_{\text{reg},n}(j, z) = \frac{1}{N_n(A)} \sum_{i=1}^{n} (Y_i - \bar{Y}_A)^2 \mathbf{1}_{X_i \in A} - \frac{1}{N_n(A)} \left(\sum_{i=1}^{n} (Y_i - \bar{Y}_{AL})^2 \mathbf{1}_{X_i \in AL} + \sum_{i=1}^{n} (Y_i - \bar{Y}_{AR})^2 \mathbf{1}_{X_i \in AR}\right) \]

  • \(N_n(A)\): Number of data points in cell \(A\).
  • \(Y_i\): Response variable for observation \(i\).
  • \(\bar{Y}_A\): Mean of \(Y_i\) in cell \(A\).
  • \(AL\) and \(AR\): Left and right child nodes after the split.

Stopping Condition:

Nodes are not split if they contain fewer than nodesize points or if all \(X_i\) in the node are identical.


Prediction:

\[ m_{M, n}(x) = \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) \]

  • \(M\): Total number of trees in the forest.
  • \(m_n(x; \Theta_j, D_n)\): Prediction from the \(j\)-th tree.

Random Forest Classification

Splitting Criteria:

  • The Gini impurity measure is used to determine the best split:

\[ G = 1 - \sum_{k=1}^{K} p_k^2 \]

  • \(p_k\) represents the proportion of samples of class \(k\) in the node.
  • \(K\) is the number of classes.

Prediction:

  • Each tree makes a prediction using the majority class in the cell containing \(x\).
  • Classification uses a majority vote:

\[ m_{M, n}(x; \Theta_1, \ldots, \Theta_M, D_n) = \begin{cases} 1 & \text{if } \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) > \frac{1}{2} \\ 0 & \text{otherwise} \end{cases} \]

  • \(m_n(x; \Theta_j, D_n)\): Prediction from the \(j\)-th tree.
  • \(M\): Total number of trees in the forest.

Like this…

[3]

The Data

  1. Where the data came from
  2. Size of data
  3. Key variables - Demographic, Behavioral, Seasonal
  4. Preprocessing needed

SSI Sales Data

Forest Foresight - SSI Sales Data
Total Orders Closed Short Fulfilled
(n=7585) (n=733) (n=6852)
Top Customers
Smoothie Island 1701 (22.43%) 455 (62.07%) 1246 (18.18%)
Philly Bite 1556 (20.51%) 267 (36.43%) 1289 (18.81%)
PlatePioneers 1396 (18.40%) 143 (19.51%) 1253 (18.29%)
Berl Company 906 (11.94%) 5 (0.68%) 901 (13.15%)
DineLink Intl 589 (7.77%) 42 (5.73%) 547 (7.98%)
Top Products
DC-01  (Drink carrier) 1135 (14.96%) 345 (47.07%) 790 (11.53%)
TSC-PQB-01  (Paper Quesadilla Clamshell)    1087 (14.33%) 389 (53.07%) 698 (10.19%)
TSC-PW14X16-01  (1-Play Paper Wrapper) 848 (11.18%) 283 (38.61%) 565 (8.25%)
CMI-PCK-01  (Wrapped Plastic Cutlery Kit) 802 (10.57%) 288 (39.29%) 514 (7.50%)
PC-05-B1  (Black 5oz Container) 745 (9.82%) 220 (30.01%) 525 (7.66%)

Sales over Time

Analysis - Stutti

Predicting Customer Churn

  • The churn indicator was created based on the Last Sales Date (0/1).
  • Predictors: Class, Product, Qty Ordered, and Date Fulfilled.
  • The model was evaluated using statistics from the Confusion Matrix.
  • 80% Accuracy achieved:
    • Sensitivity: The model correctly identifies 78.6% of the actual 0 cases.
    • Specificity: The model correctly identifies 88.12% of the actual 1 cases.
    • Negative Predictive Value (NPV for class 1): When the model predicts 1, it is correct only 47.62% of the time. This lower NPV suggests the model might be missing some 1 cases.
    • McNemar’s Test P-value (<2e-16): Indicates that the model struggles slightly with misclassification between classes.
  • Conclusion: Overall, the model has a good balance (0.8336) between identifying both classes, though it is better at predicting class 0.

Analysis - Matt

Confusion Matrix

  • 0: Non-high-revenue OPCO
  • 1: High-revenue OPCO

Model Statistics

Metric Value
Accuracy Accuracy 0.956
95% CI (0.951, 0.96)
Kappa Kappa 0.73
Sensitivity Sensitivity 0.66256
Specificity Specificity 0.98911
Pos Pred Value Pos Pred Value 0.87297
Neg Pred Value Neg Pred Value 0.96291
Prevalence Prevalence 0.10146
Detection Rate Detection Rate 0.06722
Balanced Accuracy Balanced Accuracy 0.82583

ROC Curve Analysis

Figure 1: ROC Curve for High Revenue Prediction

Feature Importance

Figure 2: Feature Importance for High Revenue Prediction

Analysis - Mika

Random Forest Model Summary

  • The Random Forest model was trained to predict QuantityFulfilled using 100 records of sales data.
  • The model used 8 predictor variables and 100 bootstrap iterations.
  • Bootstrap validation was performed to assess model stability and performance.

Model Performance Metrics

  • Original MSE: 800.89 units²
  • RMSE: 28.30 units (shows average prediction error in original units)
  • MAE: 13.09 units (shows average absolute prediction error)
  • Bias: -204.04 (indicates the model tends to underestimate)
  • Standard Error: 292.65 (shows variability in predictions)
  • 95% Confidence Interval: (198.3, 1449.2) for MSE

Remarks

  • The model predicts QuantityFulfilled with an average error of about 28 units (RMSE).
  • Predictions are typically off by about 13 units (MAE).
  • The model shows consistent negative bias (-204.04), suggesting systematic underestimation.
  • A wide confidence interval (198.3 to 1449.2) indicates high variability in prediction accuracy.
  • Numerical variables (qtyOrdered, TotalPrice) are substantially more important than categorical ones.

Conclusion

  • High Performance & Versatile: Robust, accurate, handles noisy/high-dimensional data. [1]
  • Ensemble Strength: Averaging over many trees; identifies key variables. [1]
  • Challenges: Limited theoretical understanding; complex interpretation. [1]
  • Future Focus: Enhance theory, increase interpretability, broaden applications. [1]

References

[1]
G. Biau and E. Scornet, “A random forest guided tour,” Test (Madr.), vol. 25, no. 2, pp. 197–227, Jun. 2016.
[2]
[3]
Y. Fu, “Combination of random forests and neural networks in social lending,” Journal of Financial Risk Management, vol. 6, no. 4, pp. 418–426, 2017, doi: 10.4236/jfrm.2017.64030.